Cache optimization #2339

dellaert · 2026-01-04T23:53:04Z

Overview

This PR introduces significant performance improvements and architectural updates to the MultifrontalSolver, focusing on cache locality, memory efficiency, and parallel granularity.

Key Optimizations:

QR Elimination for Leaves:
- Implemented SolveMode::QrLeaf to use QR factorization directly on leaf cliques with high aspect ratios (rows >> cols) and no prior Hessian factors.
- Benefit: Avoids explicitly forming normal equations ($A^T A$) for these blocks, improving numerical stability and significantly improving performance for large BAL files.
Cache Locality & "Fused" Path:
- Introduced a fused eliminateInPlace(graph) path that interleaves the Load (filling matrices) and Eliminate (factorization) steps during the post-order traversal.
- Benefit: Keeps data "hot" in the cache by processing a clique immediately after loading its data, rather than iterating over the entire graph twice. Clear improvement with TBB.
Improved Memory Management:
- Cached Factorization (RSd_): The solver now explicitly caches the elimination result $[R|S|d]$ in a separate VerticalBlockMatrix (RSd_) instead of reusing the accumulator matrix. This optimizes memory access patterns during the back-substitution (solve) phase. Almost doubles perf in 135 BAL dataset!
Parallel Task Tuning:
- Increased the problem-size threshold for parallel execution in updateSolution (from 10 to 4096).
- Benefit: Reduces TBB scheduling overhead for small cliques (common in SFM), ensuring they are processed sequentially to maximize cache efficiency.

Architectural Changes

Precomputed Symbolic Structure:
- Decoupled symbolic analysis from solver construction via PrecomputedData.
- Benefit: will allow creation of solver even before linearization.
Damping Support:
- Added direct support for identity and diagonal damping within the clique structure, facilitating efficient Levenberg-Marquardt implementations.

Timing and Analysis

Below are the timings in Mac, with TBB enabled. M1 Macbook with only 16G Ram.

The optimizations significantly scalability, at the cost of a small regression on very small problems.

Chain benchmarks

T	Before	After
10	7.79x	5.87x
50	6.44x	5.54x
100	7.13x	7.98x
500	8.70x	11.70x
1000	9.60x	13.30x
5000	10.62x	15.72x

BAL benchmarks

Dataset	Ordering	Before	After
Dubrovnik-16	Metis	2.37x	1.97x
Dubrovnik-16	Schur	2.41x	1.85x
Dubrovnik-16	ColAMD	2.42x	2.00x
Dubrovnik-88	Metis	2.91x	3.30x
Dubrovnik-88	Schur	2.88x	3.37x
Dubrovnik-88	ColAMD	2.97x	3.53x
Dubrovnik-135	Metis	2.41x	3.55x
Dubrovnik-135	Schur	2.77x	3.54x
Dubrovnik-135	ColAMD	2.27x	4.27x

Timings after MultifrontalSolver optimizations, and before:

After

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.210584 s
  Standard GTSAM:     0.441354 s
  Speedup:            2.09586x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.192524 s
  Standard GTSAM:     0.379982 s
  Speedup:            1.97368x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.199722 s
  Standard GTSAM:     0.369379 s
  Speedup:            1.84946x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.190323 s
  Standard GTSAM:     0.379801 s
  Speedup:            1.99557x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.89305 s
  Standard GTSAM:     5.65221 s
  Speedup:            2.98577x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.65063 s
  Standard GTSAM:     5.44253 s
  Speedup:            3.29725x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.67989 s
  Standard GTSAM:     5.6682 s
  Speedup:            3.37415x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.58932 s
  Standard GTSAM:     5.60834 s
  Speedup:            3.52875x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 2.51154 s
  Standard GTSAM:     10.4708 s
  Speedup:            4.16908x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 2.56822 s
  Standard GTSAM:     9.12408 s
  Speedup:            3.55269x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 2.42682 s
  Standard GTSAM:     8.59165 s
  Speedup:            3.54028x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 2.30755 s
  Standard GTSAM:     9.85178 s
  Speedup:            4.26938x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00338821 s
  Standard GTSAM:     0.0199051 s
  Speedup:            5.87482x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    42
  frontals:   max=6 avg=2.38095
  separators: max=4 avg=3.42857
  total dim:  max=6 avg=5.80952
  children:   max=4 avg=0.97619
Clique structure after merge
  cliques:    6
  frontals:   max=28 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=28 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0151301 s
  Standard GTSAM:     0.0838724 s
  Speedup:            5.54341x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    89
  frontals:   max=6 avg=2.24719
  separators: max=4 avg=3.68539
  total dim:  max=6 avg=5.93258
  children:   max=4 avg=0.988764
Clique structure after merge
  cliques:    13
  frontals:   max=24 avg=15.3846
  separators: max=4 avg=3.07692
  total dim:  max=26 avg=18.4615
  children:   max=10 avg=0.923077

Timing results:
  MultifrontalSolver: 0.0195491 s
  Standard GTSAM:     0.155968 s
  Speedup:            7.97828x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    484
  frontals:   max=6 avg=2.06612
  separators: max=4 avg=3.92149
  total dim:  max=6 avg=5.9876
  children:   max=4 avg=0.997934
Clique structure after merge
  cliques:    72
  frontals:   max=16 avg=13.8889
  separators: max=4 avg=3.69444
  total dim:  max=20 avg=17.5833
  children:   max=8 avg=0.986111

Timing results:
  MultifrontalSolver: 0.0598728 s
  Standard GTSAM:     0.700681 s
  Speedup:            11.7028x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    983
  frontals:   max=6 avg=2.03459
  separators: max=4 avg=3.95727
  total dim:  max=6 avg=5.99186
  children:   max=4 avg=0.998983
Clique structure after merge
  cliques:    143
  frontals:   max=22 avg=13.986
  separators: max=4 avg=3.86014
  total dim:  max=24 avg=17.8462
  children:   max=12 avg=0.993007

Timing results:
  MultifrontalSolver: 0.106045 s
  Standard GTSAM:     1.41029 s
  Speedup:            13.299x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    726
  frontals:   max=22 avg=13.7741
  separators: max=4 avg=3.96419
  total dim:  max=24 avg=17.7383
  children:   max=10 avg=0.998623

Timing results:
  MultifrontalSolver: 0.425536 s
  Standard GTSAM:     6.68846 s
  Speedup:            15.7177x

Before

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.141235 s
  Standard GTSAM:     0.333536 s
  Speedup:            2.36156x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.143599 s
  Standard GTSAM:     0.340962 s
  Speedup:            2.37441x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.140152 s
  Standard GTSAM:     0.338447 s
  Speedup:            2.41486x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.149368 s
  Standard GTSAM:     0.361274 s
  Speedup:            2.41869x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.40903 s
  Standard GTSAM:     4.02093 s
  Speedup:            2.85369x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.39488 s
  Standard GTSAM:     4.0557 s
  Speedup:            2.90757x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.38117 s
  Standard GTSAM:     3.97215 s
  Speedup:            2.87594x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.35405 s
  Standard GTSAM:     4.02011 s
  Speedup:            2.96896x

Processing BAL file: /Users/dellaert/git/github/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 12.6782 s
  Standard GTSAM:     9.27636 s
  Speedup:            0.731678x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 3.42751 s
  Standard GTSAM:     8.24493 s
  Speedup:            2.40552x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 3.11985 s
  Standard GTSAM:     8.63704 s
  Speedup:            2.76842x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 3.90879 s
  Standard GTSAM:     8.87071 s
  Speedup:            2.26943x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00269038 s
  Standard GTSAM:     0.0209533 s
  Speedup:            7.78826x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    42
  frontals:   max=6 avg=2.38095
  separators: max=4 avg=3.42857
  total dim:  max=6 avg=5.80952
  children:   max=4 avg=0.97619
Clique structure after merge
  cliques:    6
  frontals:   max=28 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=28 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0141262 s
  Standard GTSAM:     0.0909991 s
  Speedup:            6.44188x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    89
  frontals:   max=6 avg=2.24719
  separators: max=4 avg=3.68539
  total dim:  max=6 avg=5.93258
  children:   max=4 avg=0.988764
Clique structure after merge
  cliques:    13
  frontals:   max=24 avg=15.3846
  separators: max=4 avg=3.07692
  total dim:  max=26 avg=18.4615
  children:   max=10 avg=0.923077

Timing results:
  MultifrontalSolver: 0.0235363 s
  Standard GTSAM:     0.167806 s
  Speedup:            7.12967x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    484
  frontals:   max=6 avg=2.06612
  separators: max=4 avg=3.92149
  total dim:  max=6 avg=5.9876
  children:   max=4 avg=0.997934
Clique structure after merge
  cliques:    72
  frontals:   max=16 avg=13.8889
  separators: max=4 avg=3.69444
  total dim:  max=20 avg=17.5833
  children:   max=8 avg=0.986111

Timing results:
  MultifrontalSolver: 0.0821268 s
  Standard GTSAM:     0.714632 s
  Speedup:            8.70157x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    983
  frontals:   max=6 avg=2.03459
  separators: max=4 avg=3.95727
  total dim:  max=6 avg=5.99186
  children:   max=4 avg=0.998983
Clique structure after merge
  cliques:    143
  frontals:   max=22 avg=13.986
  separators: max=4 avg=3.86014
  total dim:  max=24 avg=17.8462
  children:   max=12 avg=0.993007

Timing results:
  MultifrontalSolver: 0.142666 s
  Standard GTSAM:     1.36918 s
  Speedup:            9.59714x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    726
  frontals:   max=22 avg=13.7741
  separators: max=4 avg=3.96419
  total dim:  max=24 avg=17.7383
  children:   max=10 avg=0.998623

Timing results:
  MultifrontalSolver: 0.6308 s
  Standard GTSAM:     6.6965 s
  Speedup:            10.6159x

Copilot

Pull request overview

This PR introduces significant performance optimizations to the MultifrontalSolver, focusing on cache locality, memory efficiency, and parallel execution granularity. The changes implement QR factorization for leaf cliques, a fused load-and-eliminate path, cached factorization results, and improved memory management, delivering substantial speedups on larger datasets while introducing architectural improvements for precomputed symbolic data.

Key changes:

Implements QR elimination mode for high aspect-ratio leaf cliques without prior Hessian factors
Introduces fused eliminateInPlace(graph) that interleaves loading and elimination in a single traversal
Caches factorization results in separate RSd_ matrix for optimized back-substitution
Adds precomputed symbolic data support via PrecomputedData struct and Precompute() method
Increases parallel task threshold from 10 to 4096 to reduce scheduling overhead for small cliques

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
gtsam/linear/MultifrontalSolver.h	Adds PrecomputedData struct, new constructor overload, Precompute() static method, eliminateInPlace(graph) overload, and state tracking flags
gtsam/linear/MultifrontalSolver.cpp	Refactors constructor to use precomputed data, implements fused elimination path, adds row counting for VBM sizing, updates parallel thresholds
gtsam/linear/MultifrontalClique.h	Adds QR mode, damping methods, RSd_ caching, prepareForElimination/factorize separation, IndexedSymbolicFactor row tracking
gtsam/linear/MultifrontalClique.cpp	Implements lazy SBM allocation, QR leaf factorization, damping support, cached factorization, separator-only SBM for QR updates
gtsam/base/SymmetricBlockMatrix.h	Adds utility methods for diagonal damping (addToDiagonalBlock, addScaledIdentity)
gtsam/linear/tests/testMultifrontalSolver.cpp	Updates tests to explicitly call load(), adds test for new eliminateInPlace(graph) API
timing/timeSFMBAL.h	Adds createOrderings helper function, minor formatting fixes
timing/timeMultifrontalSolver.cpp	Refactors benchmarks into separate functions, uses fused elimination path, adds BAL135 test

gtsam/linear/MultifrontalClique.cpp

gtsam/linear/MultifrontalClique.h

gtsam/linear/tests/testMultifrontalSolver.cpp

gtsam/linear/MultifrontalSolver.h

timing/timeSFMBAL.h

gtsam/linear/MultifrontalClique.cpp

timing/timeMultifrontalSolver.cpp

gtsam/linear/MultifrontalClique.h

dellaert · 2026-01-05T04:24:00Z

Linux Timing (TBB)

Similar conclusions. Chain improvement is massive, BAL less so, and speedups against GTSAM are less pronounced than on Mac, but this might just be because Linux is a 20 core machine with 32G RAM and a nice cache:

  L1d:                       768 KiB (20 instances)
  L1i:                       1 MiB (20 instances)
  L2:                        28 MiB (11 instances)
  L3:                        33 MiB (1 instance)

Analysis

T	Before	After
10	11.19x	6.21x
50	4.63x	6.74x
100	8.02x	10.86x
500	9.31x	14.89x
1000	11.16x	24.06x
5000	11.63x	26.62x

Dataset	Ordering	Before	After
Dubrovnik-16	Metis	1.57x	1.93x
Dubrovnik-16	Schur	1.65x	1.84x
Dubrovnik-16	ColAMD	1.58x	1.93x
Dubrovnik-88	Metis	1.24x	1.44x
Dubrovnik-88	Schur	1.24x	1.37x
Dubrovnik-88	ColAMD	1.15x	1.43x
Dubrovnik-135	Metis	1.08x	1.40x
Dubrovnik-135	Schur	1.08x	1.23x
Dubrovnik-135	ColAMD	1.08x	1.30x

After

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.114597 s
  Standard GTSAM:     0.231649 s
  Speedup:            2.02142x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.115449 s
  Standard GTSAM:     0.223312 s
  Speedup:            1.93429x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.11521 s
  Standard GTSAM:     0.212224 s
  Speedup:            1.84206x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.11649 s
  Standard GTSAM:     0.225308 s
  Speedup:            1.93415x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.93307 s
  Standard GTSAM:     1.32023 s
  Speedup:            1.41494x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.915436 s
  Standard GTSAM:     1.32052 s
  Speedup:            1.4425x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.929393 s
  Standard GTSAM:     1.27675 s
  Speedup:            1.37375x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.890635 s
  Standard GTSAM:     1.27797 s
  Speedup:            1.4349x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.59295 s
  Standard GTSAM:     2.05534 s
  Speedup:            1.29027x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.45838 s
  Standard GTSAM:     2.04895 s
  Speedup:            1.40495x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.67949 s
  Standard GTSAM:     2.06911 s
  Speedup:            1.23199x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.64349 s
  Standard GTSAM:     2.13348 s
  Speedup:            1.29814x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00432555 s
  Standard GTSAM:     0.0268508 s
  Speedup:            6.2075x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    41
  frontals:   max=6 avg=2.43902
  separators: max=4 avg=3.41463
  total dim:  max=6 avg=5.85366
  children:   max=4 avg=0.97561
Clique structure after merge
  cliques:    6
  frontals:   max=24 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=24 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0117721 s
  Standard GTSAM:     0.0793824 s
  Speedup:            6.74328x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    90
  frontals:   max=6 avg=2.22222
  separators: max=4 avg=3.68889
  total dim:  max=6 avg=5.91111
  children:   max=4 avg=0.988889
Clique structure after merge
  cliques:    14
  frontals:   max=16 avg=14.2857
  separators: max=4 avg=3.14286
  total dim:  max=20 avg=17.4286
  children:   max=9 avg=0.928571

Timing results:
  MultifrontalSolver: 0.0147134 s
  Standard GTSAM:     0.159792 s
  Speedup:            10.8603x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    485
  frontals:   max=6 avg=2.06186
  separators: max=4 avg=3.92165
  total dim:  max=6 avg=5.98351
  children:   max=4 avg=0.997938
Clique structure after merge
  cliques:    71
  frontals:   max=22 avg=14.0845
  separators: max=4 avg=3.71831
  total dim:  max=24 avg=17.8028
  children:   max=12 avg=0.985915

Timing results:
  MultifrontalSolver: 0.0388443 s
  Standard GTSAM:     0.578314 s
  Speedup:            14.888x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    982
  frontals:   max=6 avg=2.03666
  separators: max=4 avg=3.95723
  total dim:  max=6 avg=5.99389
  children:   max=4 avg=0.998982
Clique structure after merge
  cliques:    144
  frontals:   max=18 avg=13.8889
  separators: max=4 avg=3.83333
  total dim:  max=20 avg=17.7222
  children:   max=10 avg=0.993056

Timing results:
  MultifrontalSolver: 0.0462555 s
  Standard GTSAM:     1.11295 s
  Speedup:            24.0608x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    728
  frontals:   max=22 avg=13.7363
  separators: max=4 avg=3.96154
  total dim:  max=24 avg=17.6978
  children:   max=9 avg=0.998626

Timing results:
  MultifrontalSolver: 0.183805 s
  Standard GTSAM:     4.89218 s
  Speedup:            26.6161x

Before

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-16-22106-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 0.139262 s
  Standard GTSAM:     0.235353 s
  Speedup:            1.69x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 0.139127 s
  Standard GTSAM:     0.218892 s
  Speedup:            1.57333x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 0.141829 s
  Standard GTSAM:     0.234691 s
  Speedup:            1.65475x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 0.143529 s
  Standard GTSAM:     0.226813 s
  Speedup:            1.58026x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-88-64298-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.03922 s
  Standard GTSAM:     1.27513 s
  Speedup:            1.22701x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.10096 s
  Standard GTSAM:     1.36733 s
  Speedup:            1.24194x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.05544 s
  Standard GTSAM:     1.30418 s
  Speedup:            1.23568x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.08904 s
  Standard GTSAM:     1.24707 s
  Speedup:            1.14511x

Processing BAL file: /home/dellaert/git/gtsam/examples/Data/dubrovnik-135-90642-pre.txt

BAL Benchmark (Burn, iterations=2):
  MultifrontalSolver: 1.77369 s
  Standard GTSAM:     2.11584 s
  Speedup:            1.1929x

BAL Benchmark (Metis, iterations=2):
  MultifrontalSolver: 1.89476 s
  Standard GTSAM:     2.05151 s
  Speedup:            1.08273x

BAL Benchmark (Schur, iterations=2):
  MultifrontalSolver: 1.85406 s
  Standard GTSAM:     2.00922 s
  Speedup:            1.08369x

BAL Benchmark (Colamd, iterations=2):
  MultifrontalSolver: 1.79971 s
  Standard GTSAM:     1.94567 s
  Speedup:            1.0811x

Benchmark (T=10, iterations=500):
Symbolic cluster structure
  cliques:    7
  frontals:   max=6 avg=2.85714
  separators: max=4 avg=2.28571
  total dim:  max=6 avg=5.14286
  children:   max=3 avg=0.857143
Clique structure after merge
  cliques:    2
  frontals:   max=10 avg=10
  separators: max=2 avg=1
  total dim:  max=12 avg=11
  children:   max=1 avg=0.5

Timing results:
  MultifrontalSolver: 0.00273653 s
  Standard GTSAM:     0.0306251 s
  Speedup:            11.1912x

Benchmark (T=50, iterations=500):
Symbolic cluster structure
  cliques:    41
  frontals:   max=6 avg=2.43902
  separators: max=4 avg=3.41463
  total dim:  max=6 avg=5.85366
  children:   max=4 avg=0.97561
Clique structure after merge
  cliques:    6
  frontals:   max=24 avg=16.6667
  separators: max=4 avg=2.33333
  total dim:  max=24 avg=19
  children:   max=4 avg=0.833333

Timing results:
  MultifrontalSolver: 0.0137741 s
  Standard GTSAM:     0.0637868 s
  Speedup:            4.63092x

Benchmark (T=100, iterations=500):
Symbolic cluster structure
  cliques:    90
  frontals:   max=6 avg=2.22222
  separators: max=4 avg=3.68889
  total dim:  max=6 avg=5.91111
  children:   max=4 avg=0.988889
Clique structure after merge
  cliques:    14
  frontals:   max=16 avg=14.2857
  separators: max=4 avg=3.14286
  total dim:  max=20 avg=17.4286
  children:   max=9 avg=0.928571

Timing results:
  MultifrontalSolver: 0.0155799 s
  Standard GTSAM:     0.124997 s
  Speedup:            8.02298x

Benchmark (T=500, iterations=500):
Symbolic cluster structure
  cliques:    485
  frontals:   max=6 avg=2.06186
  separators: max=4 avg=3.92165
  total dim:  max=6 avg=5.98351
  children:   max=4 avg=0.997938
Clique structure after merge
  cliques:    71
  frontals:   max=22 avg=14.0845
  separators: max=4 avg=3.71831
  total dim:  max=24 avg=17.8028
  children:   max=12 avg=0.985915

Timing results:
  MultifrontalSolver: 0.0528516 s
  Standard GTSAM:     0.492075 s
  Speedup:            9.3105x

Benchmark (T=1000, iterations=500):
Symbolic cluster structure
  cliques:    982
  frontals:   max=6 avg=2.03666
  separators: max=4 avg=3.95723
  total dim:  max=6 avg=5.99389
  children:   max=4 avg=0.998982
Clique structure after merge
  cliques:    144
  frontals:   max=18 avg=13.8889
  separators: max=4 avg=3.83333
  total dim:  max=20 avg=17.7222
  children:   max=10 avg=0.993056

Timing results:
  MultifrontalSolver: 0.0921179 s
  Standard GTSAM:     1.02779 s
  Speedup:            11.1573x

Benchmark (T=5000, iterations=500):
Symbolic cluster structure
  cliques:    4978
  frontals:   max=6 avg=2.00884
  separators: max=4 avg=3.98955
  total dim:  max=6 avg=5.99839
  children:   max=4 avg=0.999799
Clique structure after merge
  cliques:    728
  frontals:   max=22 avg=13.7363
  separators: max=4 avg=3.96154
  total dim:  max=24 avg=17.6978
  children:   max=9 avg=0.998626

Timing results:
  MultifrontalSolver: 0.421384 s
  Standard GTSAM:     4.9015 s
  Speedup:            11.6319x

dellaert · 2026-01-05T05:39:47Z

Unfortunately, without TBB on Linux, for BAL datasets, we have worse than GTSAM results - which stumps me. The overall picture seems to be:

Chains:

	Single-thread	TBB
Mac	Faster	Much Faster
Linux	Faster	Much Much Faster

BAL:

	Single-thread	TBB
Mac	Faster	Much Faster
Linux	Slower	Faster

Some detailed profiling on Linux, single-thread, might at least bring that on par.

gtsam/linear/MultifrontalSolver.cpp

dellaert · 2026-01-05T17:52:26Z

@ProfFan the RSd_ refactor works, but on Linux still no improvement for single-threaded case (for large BAL): the hotspot just shifted to inPlaceQR. Switching to Cholesky did not help. Still stumped, but I think this PR can be merged.

dellaert · 2026-01-05T17:54:23Z

PS for chains even single-threaded Linux is 4-6 times faster, it's really something about the massive fan-in for BAL cliques.

ProfFan

I'll do more profiling with this PR merged

dellaert added 7 commits January 4, 2026 13:53

Constructor with PrecomputedData

ad3551b

Check load occurred

11e4c89

In solver, different problem-size threshold

b3657fa

eliminated_ flag

37ea3b2

Store RSd_ to avoid trashing cache.

3b9b48f

QR mode if n>2*m

a25afc0

Fused path

d57dbcc

dellaert requested a review from Copilot January 4, 2026 23:57

Copilot started reviewing on behalf of dellaert January 4, 2026 23:57 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

Review comments

ab62e92

dellaert added 2 commits January 5, 2026 00:17

UpdateFromOuterProductBlocks

fa50032

Use UpdateFromOuterProductBlocks

178ed2d

dellaert requested a review from ProfFan January 5, 2026 05:40

ProfFan reviewed Jan 5, 2026

View reviewed changes

gtsam/linear/MultifrontalSolver.cpp Show resolved Hide resolved

dellaert added 2 commits January 5, 2026 12:35

In-place split

0f7e2a6

No more mallocs in QR path

e5e01f4

ProfFan approved these changes Jan 6, 2026

View reviewed changes

dellaert merged commit 3eebc5b into develop Jan 6, 2026
34 checks passed

dellaert deleted the feature/fasterSolver branch January 6, 2026 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache optimization #2339

Cache optimization #2339

Uh oh!

dellaert commented Jan 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dellaert commented Jan 5, 2026 •

edited

Loading

Uh oh!

dellaert commented Jan 5, 2026

Uh oh!

Uh oh!

dellaert commented Jan 5, 2026

Uh oh!

dellaert commented Jan 5, 2026

Uh oh!

ProfFan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cache optimization #2339

Cache optimization #2339

Uh oh!

Conversation

dellaert commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Timing and Analysis

Chain benchmarks

BAL benchmarks

Timings after MultifrontalSolver optimizations, and before:

After

Before

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dellaert commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linux Timing (TBB)

Analysis

After

Before

Uh oh!

dellaert commented Jan 5, 2026

Chains:

BAL:

Uh oh!

Uh oh!

dellaert commented Jan 5, 2026

Uh oh!

dellaert commented Jan 5, 2026

Uh oh!

ProfFan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dellaert commented Jan 4, 2026 •

edited

Loading

dellaert commented Jan 5, 2026 •

edited

Loading